12 research outputs found

    SOTXTSTREAM: Density-based self-organizing clustering of text streams

    Get PDF
    A streaming data clustering algorithm is presented building upon the density-based selforganizing stream clustering algorithm SOSTREAM. Many density-based clustering algorithms are limited by their inability to identify clusters with heterogeneous density. SOSTREAM addresses this limitation through the use of local (nearest neighbor-based) density determinations. Additionally, many stream clustering algorithms use a two-phase clustering approach. In the first phase, a micro-clustering solution is maintained online, while in the second phase, the micro-clustering solution is clustered offline to produce a macro solution. By performing self-organization techniques on micro-clusters in the online phase, SOSTREAM is able to maintain a macro clustering solution in a single phase. Leveraging concepts from SOSTREAM, a new density-based self-organizing text stream clustering algorithm, SOTXTSTREAM, is presented that addresses several shortcomings of SOSTREAM. Gains in clustering performance of this new algorithm are demonstrated on several real-world text stream datasets

    K-NEAREST NEIGHBORS DENSITY-BASED CLUSTERING

    No full text
    Traditional density-based clustering approaches rely on a distance-based parameter to define data connectivity and density. However, an appropriate value of this parameter can be difficult to determine as it is highly dependent on the underlying distribution of the data. In particular, distribution parameters affect the scale of inter-group distances (e.g., variance); this dependence leads to a well-known inability to simultaneously detect clusters at varying levels of density. In this work, connectivity and density are defined according to the rank-order induced by the distance metric (i.e., invariant to the expected scale of the distances). Connectivity by k-nearest neighbors and density by the number of reverse k-nearest neighbors (i.e., vertex in-degree in the directed k-nearest neighbors graph). Two novel density-based clustering algorithms are proposed, the non-hierarchical RNN-DBSCAN and its hierarchical generalization Hk-DC. The advantage of RNN- DBSCAN is that it requires a single parameter k and is robust to varying levels of cluster density, whereas Hk-DC provides an efficient solution for producing a hierarchical clustering of RNN-DBSCAN solutions over k for a fixed density threshold. Importantly, heuristics are proposed for selecting k and density threshold for RNN- DBSCAN and Hk-DC, along with a method for extracting a flat clustering solution from the hierarchy. Additionally, a cluster-dependent solution for handling noise is proposed

    Number of clusters.

    No full text
    <p>Number of clusters.</p

    ARI performance of <i>SOTXTSTREAM</i> in the presence of concept drift.

    No full text
    <p>ARI performance box-plots for <i>SOTXTSTREAM</i> with respect to synthetic and non-synthetic random stream orderings. In each run, parameters were set to those listed in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0180543#pone.0180543.t002" target="_blank">Table 2</a>.</p

    <i>SOTXTSTREAM</i> functions and parameters.

    No full text
    <p><i>SOTXTSTREAM</i> functions and parameters.</p

    Pseudo-code for the <i>SOTXTSTREAM</i> algorithm.

    No full text
    <p>Pseudo-code for the <i>SOTXTSTREAM</i> algorithm.</p

    Clustering performance by ARI.

    No full text
    <p>Clustering performance by ARI.</p

    Clustering parameters.

    No full text
    <p>Clustering parameters.</p

    Wilcoxon signed-ranks test p-values.

    No full text
    <p>Wilcoxon signed-ranks test p-values.</p

    Parameter analysis of <i>SOSTREAM</i>.

    No full text
    <p>ARI performance plots for the <i>SOSTREAM</i> algorithm parameters (<i>α</i> <b>(A)</b>, <i>k</i> <b>(B)</b>, <i>m</i><sub><i>thresh</i></sub> <b>(C)</b>, <i>λ</i> <b>(D)</b>) on all datasets. In each run, parameters were set to those listed in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0180543#pone.0180543.t002" target="_blank">Table 2</a> (sans the parameter under investigation). Additionally, ARI performance is the average value across 100 random stream orderings.</p
    corecore